Optical Character Recognition of Arabic Text

Part 2 – OCR on single-column text

Thomas Hegghammer

thomas.hegghammer@all-souls.ox.ac.uk

15 September 2025

Example documents

   

  • plain: single column printed text, little noise
  • plain_scan: single column printed text, more noise
  • handwriting: single column handwritten text, more noise


  • /orig: original images
  • /gt: ground truth transcriptions


  • /sources.md: provenance details

General procedure for scaling up

  1. Create vector/list with paths of images to process
  2. Prepare directory for output files
  3. Create processing function with input and output parameters
  4. Iterate

OCR with local tools

Tesseract in R (1): Single file

# Set up tesseract
install.packages("tesseract")
library(tesseract)
engine <- tesseract("ara")

# Process
infile <- "example_docs/plain/orig/001.jpg"
text <- ocr(infile, engine)
write(text, "tess_out_plain.txt")

# Evaluate

# Install in terminal:
# pip install jiwer

## WER
command <- "jiwer -g -r example_docs/plain/gt/001.txt -h tess_out_plain.txt"
as.numeric(system(command, intern = TRUE))

# CER
command <- "jiwer -g -c -r example_docs/plain/gt/001.txt -h tess_out_plain.txt"
as.numeric(system(command, intern = TRUE))

Tesseract in R (2): Any number of files

# Store image paths in vectors
plain <- list.files("example_docs/plain/orig",
  full.names = TRUE
)

# Create variable for output dir to create
tess_plaindir <- "tesseract/out/plain"

# Create it
dir.create(tess_plaindir, recursive = TRUE)
library(stringr)

# Create function
ocr_tess <- function(input, output) {
  text <- ocr(input, engine)
  write(text, output)
}

# Loop over vector
for (i in seq_along(plain)) {
  inpath <- plain[i]
  outfile <- str_replace(
    basename(inpath), "jpg", "txt"
    )
  outpath <- file.path(tess_plaindir, outfile)
  ocr_tess(inpath, outpath)
}
# Now check "tesseract/out/plain" directory

Tesseract in Python (1): Single file

# pip install pytesseract pillow
import pytesseract
from PIL import Image
import glob
import os

# Load the image
infile = "example_docs/plain/orig/001.jpg"
image = Image.open(infile)

# Process
text = pytesseract.image_to_string(image, lang='ara')

# Write to file
with open("tess_python_out_plain_python.txt", "w", encoding="utf-8") as f:
    f.write(text)

Evaluate

import jiwer
with open("example_docs/plain/gt/001.txt", "r") as f:
    ref = f.read()
with open("tess_python_out_plain.txt", "r") as f:
    hyp = f.read()

jiwer.wer(ref, hyp)
jiwer.cer(ref, hyp)

Tesseract in Python (2): Any number of files

# Store image paths in list
plain = glob.glob("example_docs/plain/orig/*")

# Create variable for output dir to create
tess_plaindir = "tesseract/out/plain"

# Create it
os.makedirs(tess_plaindir, exist_ok=True)

# Create function
def ocr_tess(input_path, output_path):
    image = Image.open(input_path)
    text = pytesseract.image_to_string(image, lang='ara')
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(text)

# Loop over list
for i, inpath in enumerate(plain):
    outfile = os.path.basename(inpath).replace(".jpg", ".txt")
    outpath = os.path.join(tess_plaindir, outfile)
    ocr_tess(inpath, outpath)

Kraken

On command line:

# Install Kraken and dependencies
sudo apt install -y pipx
sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update
sudo apt install -y python3.11
sudo pipx install --python python3.11 kraken

# Download model
kraken get 10.5281/zenodo.7050296 #arabic_best.mlmodel

Kraken in R

library(stringr)
infile <- "example_docs/plain/orig/001.jpg"
outfile <- "kraken_out_plain.txt"
command <- str_glue("kraken -i '{infile}' '{outfile}' binarize segment -bl ocr -m arabic_best.mlmodel")
system(command, intern = TRUE)
# Slow without GPU

# Evaluate
## WER
command <- "jiwer -g -r example_docs/plain/gt/001.txt -h kraken_out_plain.txt"
as.numeric(system(command, intern = TRUE))

# CER
command <- "jiwer -g -c -r example_docs/plain/gt/001.txt -h kraken_out_plain.txt"
as.numeric(system(command, intern = TRUE))

Kraken in Python

import subprocess

infile = "example_docs/plain/orig/001.jpg"
outfile = "kraken_out_plain.txt"

command = [
    "kraken",
    "-i", infile, outfile,
    "binarize", "segment", "-bl", "ocr",
    "-m", "arabic_best.mlmodel"
]

# Run and capture stdout
result = subprocess.run(command, capture_output=True, text=True)

Surya in R

Install with

# Run in terminal
pip install surya-ocr
library(stringr)
outdir <- "surya/out/plain"
dir.create(outdir, recursive = TRUE)
infile <- "example_docs/plain/orig/001.jpg"
command <- str_glue("surya_ocr {infile} --output_dir {outdir}")
system(command, intern = TRUE)
# command returns no text; output ends up in outdir

Surya in Python

import subprocess

infile = "example_docs/plain/orig/001.jpg"
outdir = "surya/out/plain"
command = ["surya_ocr", infile, "--output_dir", outdir]
result = subprocess.run(command, capture_output=True, text=True)

OCR with API

API credentials management (1)

  • Create a file called /home/vscode/.Renviron
  • Place API keys in it like this:
GCS_AUTH_FILE=/home/vscode/keys/my_gcs_auth_key.json
DAI_PROCESSOR_ID=1234567890 # change
MISTRAL_API_KEY=1234567890 # change
  • In R, they will be registered automatically
  • In Python, use dotenv package
# pip install python-dotenv
import os
from dotenv import load_dotenv
load_dotenv(dotenv_path="/home/thomas/.Renviron")
# To store specific env vars in objects:
api_key = os.getenv("MISTRAL_API_KEY")

About the Google Cloud Services keyfile:

  • Rename file to my_gcs_auth_key.json
  • Upload file to Codespace manually (rightclick file explorer, choose upload)
  • Then run this in Codespace terminal:
mkdir /home/vscode/keys
mv my_gcs_auth_key.json /home/vscode/keys

API credentials management (2)

In home folder of Codespace, create file called .gitignore

Add the following to it:

# Don't commit sensitive stuff
/home/vscode/keys
/home/vscode/.Renviron

Google Document AI (R)

install.packages("daiR")
library(daiR)

infile <- "example_docs/plain/orig/001.jpg"
outfile <- "dai_out_plain.txt"
resp <- dai_sync(infile, proc_v = "rc")
text <- get_text(resp)
write(text, outfile)

# Evaluate
## WER
command <- "jiwer -g -r example_docs/plain/gt/001.txt -h dai_out_plain.txt"
as.numeric(system(command, intern = TRUE))

# CER
command <- "jiwer -g -c -r example_docs/plain/gt/001.txt -h dai_out_plain.txt"
as.numeric(system(command, intern = TRUE))

Google Document AI (Python)

# pip install google-cloud-documentai

from google.cloud import documentai_v1 as documentai

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = PATH_TO_KEYFILE
project_id = YOUR_PROJECT
location = YOUR_REGION
processor_id = YOUR_PROCESSOR_ID

def process_dai(
    project_id: str,
    location: str,
    processor_id: str,
    file_path: str,
    mime_type: str = "application/pdf",
):
    client_options = {"api_endpoint": f"{location}-documentai.googleapis.com"}
    client = documentai.DocumentProcessorServiceClient(client_options=client_options)
    name = client.processor_path(project_id, location, processor_id)

    # Read file into memory
    with open(file_path, "rb") as f:
        file_content = f.read()
    raw_document = documentai.RawDocument(content=file_content, mime_type=mime_type)
    request = documentai.ProcessRequest(name=name, raw_document=raw_document)
    result = client.process_document(request=request)
    document = result.document
    return document

infile = "example_docs/plain/orig/001.jpg"
outfile = "dai_python_out.txt"

doc = process_document_sample(project_id, location, processor_id, file_path = infile)

with open(outfile, "w") as f:
    f.write(doc.text)

Mistral OCR in Python

import base64
import os
import glob
from mistralai import Mistral
from dotenv import load_dotenv
load_dotenv(dotenv_path="/home/vscode/.Renviron")

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

infile = "example_docs/plain/orig/001.jpg"

# Getting the base64 string
base64_image = encode_image(infile)

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "image_url",
        "image_url": f"data:image/jpeg;base64,{base64_image}" 
    },
    include_image_base64=True
)

text = response.pages[0].markdown

with open("mistralocr_out_plain.txt", "w") as f:
    f.write(text)

Evaluate

import jiwer
with open("example_docs/plain/gt/001.txt", "r") as f:
    ref = f.read()
with open("mistralocr_out_plain.txt", "r") as f:
    hyp = f.read()

jiwer.wer(ref, hyp)
jiwer.cer(ref, hyp)

Postprocessing with LLMs

Prompt

  • See postprocessing_prompt.md in repo.

  • Edit to your needs

  • Note the “Historical context for the document” section. Either tailor to the document or remove

Postprocess with Mistral in R

install.packages("ellmer")
install.packages("tidyverse")
library(ellmer)
library(tidyverse)

# Activate
mistral <- chat_mistral()
# uses mistral-large by default

# Get text from files
preprocess_prompt <- read_file("postprocess_prompt.md")
text_to_clean <- read_file("mistralocr_out_plain.txt")

# Build prompt
prompt <- paste(preprocess_prompt, text_to_clean)

# Make call
cleaned_text <- mistral$chat(prompt)

# Save
write(cleaned_text, "mistralocr_out_plain_cleaned.txt")

# Evaluate
## WER
command <- "jiwer -g -r example_docs/plain/gt/001.txt -h mistralocr_out_plain_cleaned.txt"
as.numeric(system(command, intern = TRUE))

# CER
command <- "jiwer -g -c -r example_docs/plain/gt/001.txt -h mistralocr_out_plain_cleaned.txt"
as.numeric(system(command, intern = TRUE))

Postprocess with Mistral in Python

# pip install mistralai
import os
from mistralai import Mistral
from dotenv import load_dotenv
load_dotenv(dotenv_path="/home/vscode/.Renviron")

api_key = os.getenv("MISTRAL_API_KEY")  # or put it here as string
model = "mistral-large-latest"

# Activate client
client = Mistral(api_key=api_key)

# Get text from files
with open("postprocess_prompt.md", "r", encoding="utf-8") as f:
    preprocess_prompt = f.read()
with open("tesseract_out.txt", "r", encoding="utf-8") as f:
    text_to_clean = f.read()

# Build prompt
prompt = preprocess_prompt + text_to_clean

# Make call
response = client.chat.complete(
    model=model,
    messages=[{"role": "user", "content": prompt}]
)

text = response.choices[0].message.content

Key takeaways

  • Programmatic OCR allows scaling, choice, and modularity
  • Character recognition and layout parsing are separate problems
  • Scale up by 1) creating vector/list of image paths, 2) creating a custom function, and 3) Iterating over the vector/list with it
  • API services (Google Document AI, Mistral OCR) are generally more powerful out of the box, but Kraken and Surya are not far behind
  • Postprocessing can close the gap between open-source and proprietary tools
  • It’s worth learning how to set up and use APIs